Dataset: Red wine dataset consiting of 1599 variants of the Portuguese “Vinho Verde” wine. Includes physicochemical (inputs) and sensory (the output) variables. This dataset is public available for research. The details are described in [Cortez et al., 2009].
Overview: There are 13 variables within the original dataset. Of these one is a dependent variable giving a subjective measure of quality based on experts sensory reviw of the wine. The main 11 variables are independent physiochemical tests. These may be inter-related but are initially thought of as individual measurements.
See here for more information.
Input variables (based on physicochemical tests):
Output variable (based on sensory data):
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality quality_c
## Min. : 8.40 Min. :3.000 lower:744
## 1st Qu.: 9.50 1st Qu.:5.000 upper:855
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The summary table depicts descriptive statistics for all of the variables.
Quality is measured from 0-10 based. To plot effectively each element from 0-10 is included to ensure the representation of data is better presented. This variable has been reformatted to a factor with 11 levels (from orignal 6).
One categorical variables within the dataset is generated to group quality into two categories. Lower (<=5) and Upper (>5). This will be used to explore for any major differences in values.
Ph is on a scale of 0 to 14 (potential of hydrogen), all values here are acidic (<7) with the range occuring between 2.7-4.
Alcholol is measured in percentages.
The other variables are all numeric data types.
There are no missing values within the dataset.
Quality is between 2 and 8 despite the scoring range being 0-10. The majority of values are 5 and 6. There is skew to the right tail giving more positive scores in 7. Giving the poor sampling of scores outside of 5 and 6 the quality measure will be split into two categories, upper (>5) and lower (<=5).
Alcohol also has a skewed distribution of values but with most values located between 9% and 13%. A small histogram can give a quick impression of the distribution of data more effectively than reading the descriptive statistics.
Starting to investigate further variables a group of 9 measurements are placed into histograms to understand their distributions. Of these pH, Fixed acidity and Volatile Acidity have slightly normal distributions, these could be investigated further using quantile quantile plots.
Residual Sugar and Chlorides both appear to have highly skewed, over dispersed data, these could be explored using a transformation to investigate the distribution of values further.
Citric Acidity has two spikes, one occuring at 0 and one around 0.5. Bin size could be changed to investigate if there is some artifact in the data related to rounding.
Density appears relatively normal but with occasional spikes, these should be checked for rounding issues.
Free sulphar dioxide and total sulphar dioxide have skewed distributions, with a sparse population of values towards the maximum.
Two plots detailing pH value.
Now using these two plots pH can be examined in greater detail. pH appears to have light tails as displayed by the overall sigmoidal shape of the points, it displays very fine grained clustering which is related to rounding of values and it shows an overall skew to the right.
Towards the upper right of the quantile quantile plot against a standard normal distribution the line crosses the 95% confidence intervals.
Density also shows light tails and skew to the right leading to the fit for a normal distribution exisiting outside of the 95% confidence intervals.
To explore over disperssed variables data transformations can be applied. The original plot followed by a square root and log10 data transformation is conducted on residual sugar in the above three plots. The data transformations give a better representation of the distribution of values. They still have long tails towards the right.
Other over disperssed variables are explored in the above plot and each is matched to a data transformation that better represents the distribution of data values. Chlorides shows a slightly normal distribution with a long tail to the right, it is centered nicely after a log10 transformation.
Citric acid still has a large count of values at close to 0, the data distribution is best represented using a square root transformation, this does not bring the data closer to a normal distribution but represents the the distribution of values across the range.
Total sulfur dioxide is transformed using log10 given a wide normal styled distribution.
Free sulfure dioxide puts too many values to the right hand side during a log10 transformation so even with a skew, square root gives a better distribution of values.
Each of these transformations can be used when comparing these variables to other variables.
The dataset is in a tidy format. Each observation corresponds to a series of variables.
The dataset consists of one key dependent variable, quality. This is based on the subjective sensory assesment of experts. This should result in a value between 0-10, within this dataset the majority of the samples exist with results of 5 or 6.
The other eleven variables are all measurements. It is assumed that these are indpendent, although there may be relationships between some of these evelen variables.
The main feauture of interest is the dependent variable quality.
Without further exploratory work it can not be assesed which of the 11 measurement variables is the most important.
not applying any domain knowledge about what is most important.
One categorical variable is created based on the dependent quality variable. This splits quality into two categories, upper and lower. The purpose of this is to represent better (upper) and worse (lower) wines. This is split in two due to the limited distribution of values outside of 5 and 6. The idea is to use this new variable to see if there are any relationships in other variables that seperate better or worse wines.
but the qq plots have shown they sit outside of an idealised normal distribution.
Rounding of inputs appears to cause minor clustering within the datasets, this may relate to the tool precision used to assess the physiochemical properties of each wine.
Citric acid has a large number of values close to zero, this should be further investigated to see if this is a data quality issue or true signal.
Variables can depict over disperssed distributions with long tails to the right , using either square root or log10 transformations can help better represent the distribtion of these values. Each of these as been investigated to identify which transformation should be applied for future plots.
This section will begin with assesing if there any variables with a strong difference between upper and lower wine quality reviews.
This section will also check if any of the measurement variables are related.
Frequency polygons showing the distribution of values but now split by groups of the dependent variable quality can help show if there are trends in any of the variables and the output variables.
In the above figure alcohol shows a large count of wines with lower quality measurements are associated to lower alcohol values.
By plotting the remaining variables in a similar way it is possible to quickly identify any variables that appear to have some relationship to the upper or lower quality. The first observation is that many of the variables show little difference between the frequency polygons for upper or for lower. Those that do show a some difference, but it does not appear to be large or very obvious.
Volatile acidity shows a distribution centered to the left for upper. The frequency polygon shows lower values for upper than the lower category.
Total sulfar dioxide has none of it´s upper values in the upper group.
Both free sulfar dioxide and total sulfar dioxide show some two peaks for the upper group (this may relate to count as this is not a density plot).
Citric acid has a spike in it´s higher values for the upper group.
These issues will be worth investigating further.
The pair plot gives all combinations of bivariate analysis. It´s main limitation is that it is not created using the data transformations previously identified. This acts as a usefull way to look up pairs of values to investigate further.
As previously addressed by the bivariate frequency polygon plot very few of the datasets impact the upper and lower categories of quality significantly apart from alcohol. This can be viewed in the box plots and paired histograms. Alcohol has a 0.48 correlation to quality on this plot. Volatile acidity has a negative 0.39 and sulphates has 0.25. These are variables with the highest correlation to the quality variable.
Free sulphar and total sulphar have a correlation of 0.67 but the two variables have little correlation to other variables.
Density appears to have some weak correlation with a few variables like residual sugar and fixed acidity. This could be a candidate for multi-variate analysis.
Chlorides and residual sugars should be checked as these are both highly dispersed variables so it is difficult to see any correlation in this pair plot.
## `geom_smooth()` using method = 'gam'
Total vs. residual sulfar dioxide is plotted with both variables on a log10 scale to highlight the correlation between the two variables. There are a handfull of outliers but this appears to show a relationship.
Chlorides and residual sugar are compared with a log10 transform but there still appears to limited correlation between these two variables.
Overall there are few strong correlations between variables in this dataset. The most promising line of investigation seems to be alcohol and it´s relationship to quality.
Two other variables (Volatile acidity and sulphates) have a weak relationship to the feature of interest.
Few strong relationships exist, density appears to have associations to multiple variables and can be investigated further through multivariate analysis.
A number of plots have outliers, it would be interesting to know if there are clusters within the dataset or any other types of structures that can not be observed through bivariate analysis.
Total and free sulfar dioxide have the strongest relationship found, this would make sense as it is likely free sulfar dioxide has a proportional relationship to the total amount of sulfar dioxide in a sample.
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
The
The plot
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.
Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: [@Elsevier](http://dx.doi.org/10.1016/j.dss.2009.05.016) Pre-press (pdf) bib